HoPE: A Hybrid Approach to Position Embedding for Length Generalization in Vision-Language Models

Towards Robust Long-Context Understanding and Retrieval in Video-Comprehension Tasks

Published

May 26, 2025

Authors: H. Li et al.
Published on Arxiv: 2025-05-26
Link: http://arxiv.org/abs/2505.20444v1
Institutions: Carnegie Mellon University • Xiaohongshu Inc.
Keywords: Vision-Language Models, Rotary Position Embedding, Hybrid Position Embedding, Multimodal Transformers, Video Understanding, Dynamic Temporal Scaling, Long-Context Modeling, Qwen2, Video Retrieval, Semantic Modeling

Random Unsplash-style image

Recent progress in Vision-Language Models (VLMs) has enabled significant advances in multimodal tasks. However, these models tend to experience marked performance drops when dealing with long-context scenarios, such as the comprehension of lengthy videos. Rotary Position Embedding (RoPE), while effective for large language models, struggles to capture complex spatial-temporal dependencies in such multimodal and video-centric contexts. Moreover, current adaptations of RoPE often employ heuristic frequency allocation without robust theoretical support and face limitations in modeling long-range semantic relationships.

Expanding on these challenges, the article proposes an innovative solution with new approaches and contributions outlined as follows:

These methodological innovations naturally lead to a discussion of achieved results, which are summarized below:

Building on these results, the article concludes with the following key takeaways: